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1.  Summary 


This  research  addresses  the  initial  stages  of  the  development  of  an  enabling  technology  for  DNA 
computing  and  other  biological  assay  applications.  This  work  combines  mathematics,  computer  science 
and  chemistry.  It  is  focused  on  the  construction  of  a  biomolecular  architecture  designed  to  employ  new 
algorithmic  paradigms  based  on  the  massively  parallel  computational  power  of  DNA  hybridization.  The 
ultimate  intent  is  to  develop  a  computing  basis  to  eventually  overcome  the  exponential  time  complexity  of 
many  discrete  math  problems  so  that  they  can  be  solved  in  linear  real  time.  Many  of  these 
computationally  hard  (NP)  problems  are  critical  to  logistics,  scheduling  and  security.  In  particular,  we 
made  an  initial  application  of  biomolecular  computing  methods  to  the  identification  of  important  patterns 
in  data,  i.e.,  data  mining.  Data  mining  has  important  applications  to  information  security,  assurance  and 
superiority. 

In  this  research,  we  developed  methods  of  generating  large  collections  of  single  stranded  DNA 
sequences  called  a  DNA(n,d)  code.  DNA(n,d)  codes  serve  as  universal  components  for  biomolecular 
computing.  DNA(n,d)  codes  are  closed  under  reverse-complementation.  The  strands  in  a  DNA(n,d)  code 
have  such  binding  specificity  that  a  code  strand  will  only  hybridize  with  its  reverse-complement  and  will 
not  cross  hybridize  with  any  other  code  strand  in  the  DNA(n,d)  code.  Such  collections  of  strands  are 
crucial  to  the  success  of  Adleman  [1]  style  DNA  computing  [4],  [15],  [24],  The  collections  also  have 
important  applications  in  many  other  biological  assays.  Some  of  the  other  applications  are  single 
nucleotide  polymorphism  (SNP)  genotyping  [9],  gene  expression  profiling  [5],  DNA  chip  development 
[11],  [22],  [37],  and  self-assembly  [41]. 

In  this  proposal,  we  think  of  the  strand  design  problem  as  a  mathematical  coding  theory  problem 
and  we  use  the  insertion-deletion  metric  as  our  constraint.  This  approach  has  been  previously  suggested 
[2],  [33].  It  was  initially  implemented  in  [9]  with  excellent  binding  specificity.  We  have  dramatically 
improved  the  initial  research  in  [9]. 

Two  codewords  x  and  y  are  at  insertion-deletion  distance  at  least  d+1  if  and  only  if  their  longest 
common  subsequence  has  length  at  most  n-d-1.  This  means  that  if  up  to  d  deletions  (of  sequence  entries) 
are  made  in  any  codeword  x,  the  resulting  (and  shorter)  q-ary  sequence  could  not  have  been  obtained  by 
deleting  up  to  d  entries  in  any  other  codeword  y  with  y  ^  x .  Since  two  3 ’-5’  DNA  sequences  x,  y  of 
length  n  can  form  s  bonds  in  a  duplex  only  if  x  and  the  complement  of  y  have  a  common  subsequence  of 
length  s  the  insertion-deletion  metric  is  the  only  metric  that  can  model  this  cross-hybridization  bonding 
constraint 


1 


We  wrote  programs  that  generate  very  good  random  and  pseudo-random  DNA(n,d)  (IDq(n,d)) 

codes.  On  a  2.3  ghtz  Pentium  PC,  we  generated  a  DNA(20,5)  code  of  size  3038.  This  is  a  ten-fold 
increase  other  previously  published  constructions.  The  reason  for  this  improvement  stems  from  the 
Markov  chain  approach  we  used  to  generate  candidate  code  words. 

Given  two  n-sequences  x,  y,  all  of  our  programs  use  a  "folklore1’  dynamic  programming 
algorithm  to  find  the  longest  common  subsequence  (lcs(x,y))  between  x  and  y.  Our  very  first  program 
used  a  (reverse-)complement  cyclic  code  as  an  initial  set  from  which  to  find  a  DNA(n,d)  code  as  a 
subcode.  However,  we  eventually  realized  that  cyclic  codes  have  too  much  symmetry  to  generate  large- 
size  DNA  subcodes.  The  next  step  was  based  on  generating  uniformly  distributed  independent  random  n- 
sequences,  the  volumes  of  the  spheres  of  a  fixed  radius  centered  at  many  of  these  n-sequences  is  too  large. 
We  did  not  get  the  desired  performance  here.  Our  most  recent  programs  generate  candidate  n-sequences  x 
in  the  following  way.  The  value  of  Xj  is  selected  from  {a,c,g,t}  with  uniform  probability.  Then,  the 
remaining  entries  are  generated  by  a  stationary  Markov  chain  given  by  transition  matrix  Mk  with 
parameter  k.  Then,  the  remaining  entries  are  generated  by  a  stationary  Markov  chain  given  by  transition 
matrix  M  k  with  parameter  k. 
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FIGURE  1 


These  sequences  have  low  volume  spheres  of  the  desired  radius.  However,  the  higher  the  k,  the 
fewer  the  number  of  sequences  generated.  Our  programs  take  a  dynamic  heuristic  approach.  These 
programs  start  a  high  value  of  k  and  then  check  all  values  of  the  permitted  length  of  the  longest  common 
subsequence.  This  continues  for  a  set  number  of  cycles  over  which  no  new  codeword  is  added.  The  next 
value  of  k  is  set  by  finding  the  next  highest  value  of  k  for  which  a  codeword  can  be  added  to  the  growing 
code.  Here  dichotomy  is  used.  This  heuristic  has  worked  very  well  and  is  much  better  than  the  uniform 
codeword  generation  method.  For  example,  by  using  our  Markov  chain  heuristic,  we  can  generate 
DNA(15,5)  code  of  size  104.  Using  the  uniform  distribution  method,  we  could  only  generate  DNA(15,5) 
code  of  size  14. 
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Our  DNA(n,d)  codes  can  be  applied  to  biomolecular  computing  as  described  in  [6].  Note  that 
because  of  the  properties  of  our  DNA(n,d)  code,  we  need  only  half  as  many  distinct  strands  as  were  used 
in  [6].  We  show  that  a  DNA(n,d)  code  can  be  used  to  encode  and  filter.  In  [6],  a  strand  either  encodes  or 
filters  exclusively.  We  have  applied  this  approach  to  the  data  mining  problem  of  the  identification  of 
maximal  frequent  sets.  In  short,  using  a  DNA(n,d)  code  as  universal  components,  an  assembly  process 
creates  single  stranded  DNA  molecules  called  DNA  bit  strings  that  store  and  retrieve  information. 
For  a  given  database,  a  library  of  DNA  bit  strings  is  past  through  (algorithmically  constructed) 
filters.  Then  the  DNA  bit  strings  that  remain  represent  maximal  frequent  sets  in  the  database. 


2.  Introduction 

In  [1],  [6],  [17],  [26]  it  has  been  shown  that  the  hybridization  that  occurs  between  a  DNA  strand 
and  its  Watson-Crick  complement  can  be  used  to  perform  mathematical  computation.  The  promise  of 
DNA  computing  is  that  the  massive  parallelism  of  DNA  hybridization  reactions  can  be  exploited  to 
overcome  the  time  complexity  (via  a  silicon  based  computer)  of  an  important  class  discrete  mathematical 
problems  so  that  they  can  be  solved  in  real  time.  However,  to  achieve  the  full  potential  of  DNA 
computing,  many  technological  hurdles  need  to  be  overcome.  This  work  addresses  this  issue. 

In  this  research  large  collections  of  single  stranded  DNA  sequences  called  a  DNA(n,d)  code  are 
developed.  DNA(n,d)  codes  are  closed  under  reverse-complementation.  The  strands  in  a  DNA(n,d)  code 
have  such  binding  specificity  that  a  code  strand  will  only  hybridize  with  its  reverse-complement  and  will 
not  cross  hybridize  with  any  other  code  strand  in  the  DNA(n,d)  code.  Such  collections  of  strands  are 
crucial  to  the  success  of  Adleman  [1]  style  DNA  computing  [4] ,  [15],  [24] .  The  collections  also  have 
important  applications  in  many  other  biological  assays  [3] ,  [4],  [5],  [8],  [9],  [11],  [22],  [37],  [41]. 

Single  strands  of  DNA  are  modeled  by  directed  sequences  of  letters  from  the  alphabet  {A,  C,  G, 
T}  where  A,  T  and  C,  G  are  called  complementary  pairs.  Two  oppositely  directed  DNA  sequences  are 
capable  of  coalescing  into  a  duplex.  Because  an  A  (C)  in  one  strand  can  only  bind  to  a  T  (G)  in  the 
oppositely  directed  strand,  the  greatest  energy  of  duplex  formation  is  obtained  when  the  two  sequences 
are  reverse-complements  (a.k.a.  Watson-Crick  complements)  of  one  another.  This  annealing  process  is 
referred  to  as  DNA  hybridization.  For  example,  given  the  strand  3'GTATTGAT5'  (directed  3'  to  5'),  the 
oppositely  directed  (5'  to  3'  )  strand  5'ATCAATAC3'  is  the  reverse  complement.  Henceforth,  the  terms 
complement  and  reverse-complement  are  synonymous.  See  FIGURE  2.  Since  molecules  can  turn  over  in 
solution,  our  pictures  are  intended  to  capture  this.  Because  we  are  accustom  to  working  with  the  bottom 
strand  of  a  duplex,  the  numbering  of  our  sequences  is  3'-5'  rather  than  the  more  customary  5'-3'. 


3 


3’GTATTGAT  5’ 


3’  AT  CAATAC5’ 
strand  complement 

.G  0  ¥  1  ¥  ¥  0  1  ¥  1 

3’  GTATTGAT  5’ 

strand:complement 
intended  duplex 


FIGURE  2 


Hybridization  assays  offer  the  possibility  of  simultaneously  processing  trillions  of  bits  of 
information.  In  DNA  hybridization  assays  for  biomolecular  computing,  DNA  strands  can  be  used  for 
multiple  purposes.  They  can  be  used  to  store,  write,  read  and  retrieve  information.  Hybridization  assays 
with  DNA  strands  are  also  used  to  separate,  manipulate,  identify  and  address  molecules  in  many  other 
important  experiments  beyond  biomolecular  computing  [3],  [4],  [5],  [8],  [9],  [11],  [22],  [37],  [41], 

In  DNA  computing  hybridization  assays,  each  strand  in  the  assay  must  hybridize  much  more 
strongly  to  its  complement  strand  than  to  any  other  strand  or  any  other  complement  strand.  In  such  assays, 
DNA  strands  are  synthesized  and  "labeled"  or  "fixed"  DNA  probes  are  allowed  to  hybridize  with  the 
synthesized  DNA  strands  in  a  controlled  and  algorithmic  way.  The  resulting  hybridized  and  "labeled"  or 
"fixed"  DNA  molecules  contain  information  (i.e.,  solutions  to  problems)  that  can  in  turn  be  "read"  by 
further  hybridization  reactions  or  other  means.  An  example  of  this  type  of  paradigm  is  the  sticker  method 
[33]. 


The  advantage  of  this  method  is  that  it  uses  universal  components  that  can  be  mass  produced.  We 
call  the  collection  of  universal  components  a  DNA(n,d)  code.  See  Definition  2.  The  main  problem  with 
this  basic  method  is  that  unintended  cross-hybridization  is  a  main  source  of  errors. 

Thus  the  problem  is  to  avoid  the  formation  of  unintended  duplexes  in  a  DNA(n,d)  code.  In 
FIGURE  3,  we  give  some  examples  of  how  this  could  happen.  In  FIGURE  3,  the  pairs  (R,  L)  and  (r,  1)  are 
complements.  The  only  intended  duplexes  are  R:L  and  r:l.  Base  pairing  can  (possibly)  occur  between  the 
red  bases  in  the  unintended  duplexes  listed  in  FIGURE  3.  The  other  unintended  duplexes  not  shown  in 
FIGURE  3  are  R:l,  R:R,  L:L,  1:L  and  1:1. 
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FIGURE  3 


To  deal  with  the  problem  of  unintended  duplexes,  each  intended  duplex  should  be  much  more 
energetically  stable  than  any  possible  unintended  duplex. 

In  this  way,  if  the  hybridization  assays  are  conducted  at  a  temperature  above  a  certain  threshold, 
then  only  intended  duplexes  can  form.  The  main  open  question  is  how  to  best  mathematically  model  this 
strand  design  problem. 

The  design  problem  for  a  DNA(n,d)  code  presents  a  trade-off.  In  order  to  maximize  the  amount 
of  information  that  can  be  stored  or  processed  in  parallel,  it  is  desirable  to  have  as  many  strands  as 
possible.  On  the  other  hand,  if  too  many  strands  are  used,  similar  strands  will  entail  cross-hybridization, 
reducing  the  accuracy  of  the  assay.  Thus  the  DNA(n,d)  code  must  be  constructed  to  adhere  to  some 
constraints 

This  research  improved  known  constructions  of  DNA(n,d)  codes  and  demonstrated  how  to  use 
them  as  universal  components  in  DNA  based  computing. 
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3.  Methods,  Assumptions,  Procedures 


3.1  Insertion-Deletion  Codes 

In  this  work  we  think  of  the  strand  design  problem  as  a  mathematical  coding  theory >  problem  and 
we  use  the  insertion-deletion  metric  as  our  constraint. 

While  other  "metric"  constraints  have  been  applied  to  this  problem  [7],  the  insertion-deletion 
metric  is  the  only  one  that  models  constraining  the  absolute  maximum  number  of  possible  cross- 
hybridized  (i.e.,  bad)  base  pairings.  Also  the  insertion-deletion  is  the  only  metric  that  can  be  adapted  to 
constraining  the  absolute  maximum  number  of  possible  cross-hybridized  (i.e.,  bad)  hydrogen  bounds. 

All  lower-case  Roman  variables  represent  non-negative  integers.  [n]  denotes  the  set 
{1,  2,  ...,  n}.  A  q-ary  n-sequence  x  =(x;)  is  sequence  of  length  n  with  entries  Xj  e  {0,...,q  -  1}  for  each 
1  <  i  <  n.  For  applications  to  our  strand  design  problem  we  have  q  =  4. 

Definition  1.  Given  two  q-ary  n-sequences  x  and  y,  lcs(x,y)  is  the  maximum  length  of  all  subsequences 
common  to  both  x  and  y.  The  insertion-deletion  (or  Levenshtien)  metric  is  denoted  by  p(x,y)  where 
p(x,y)  =  n- lcs(x,y).  A  collection  C  of  q-ary  n-sequences  is  called  an  IDq(n,d)code,  if  for  any 
x,  y  eC  ,  we  have  that  p(x,y)  >  d  + 1 .  Let  idq(n,d)be  the  maximum  size  of  an  IDq(n,d)  code. 

Two  codewords  x  and  y  are  at  insertion-deletion  distance  at  least  d+1  if  and  only  if  their  longest 
common  subsequence  has  length  at  most  n-d-1.  This  means  that  if  up  to  d  deletions  (of  sequence  entries) 
are  made  in  any  codeword  x,  the  resulting  (and  shorter)  q-ary  sequence  could  not  have  been  obtained  by 
deleting  up  to  d  entries  in  any  other  codeword  y  with  y  T  x.  For  this  reason  IDq(n,d)  codes  can  be 

thought  of  as  d-deletion  correcting  codes 1 .  See  Example  1 . 

Example  1.  Consider  the  4-ary  sequences  x,  y  of  length  8  with  x  =  3101 1301  and  y  =  22113300.  Then 
p(x,y)  =  4  because  lcs(x,y)  =  4.  A  common  subsequence  of  length  four  is  indicated  by  red  entries. 


1  Actually  such  a  code  can  correct  any  combination  of  up  to  d  deletions  and\or  insertions.  Hence  the  name  insertion- 
deletion  metric. 
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Note  that  by  starting  with  x  and  deleting  the  blue  entries,  we  have  the  common  sequence  1130.  Then  by 
inserting  the  entries  2,  2,  3,  0  in  the  appropriate  places,  we  obtain  y. 

There  are  a  couple  of  basic  formulas  that  form  the  basis  for  the  theory  of  ID(n,d)  codes.  Let  x  be 
a  q-ary  n-sequence.  Let  Dt(x)  be  the  set  of  all  q-ary  (n-t)-sequences  that  can  be  obtained  from  x  by  t 
deletions.  Let  It(x)  be  the  set  of  all  q-ary  (n+t)-sequences  that  can  be  obtained  from  x  by  t  insertions. 
Let  Vt(x)  be  the  set  of  all  codewords  with  insertion-deletion  distance  at  most  t  from  x.  Vt(x)  is  the 
insertion-deletion  sphere  of  radius  t  and  centered  at  x.  Let  r(x)  be  the  number  of  runs  of  x.  A  run  is  a 
maximum  interval  of  x  that  consists  of  the  same  symbol,  e.g.,  111211000033  has  five  runs.  From  [20], 

[23],  [24],  we  have  that:  ^Y(x)  ^  |Dt(x)|  <  (r(x)(+t  '1  and  |  It(x)  1=  Xf^V-1)1  • 

i=0^  '  i=0^  ' 

These  equations  give  information  about  bounds  for  IDq(n,d)  code.  Suppose  we  have  an 
IDq(n,d)  code  C.  Then  for  x,y  eC,  there  is  no  common  subsequence  of  length  n-d.  Hence  Dd(x) 

and  Dd(y)  are  disjoint.  Therefore  ^Dd(x)<qn  d.  Known  upper  bounds  are  essentially  derived 

xsC 

from  this  observation.  Asymptotic  lower  bounds  have  been  obtained  by  standard  random  coding  methods 
based  on  an  ensemble  generated  by  the  uniform  distribution  for  q-ary  n-sequences  [24].  What  makes 
IDq(n,d)  (and  DNA(n,d)  codes  below)  difficult  to  analyze  can  be  observed  by  the  following.  Since 
lcs(x,y)  >  n  —  t  if  and  only  if  y  can  be  obtained  from  x  by  t  deletions  followed  by  t  insertions,  it  follows 

that  Vt(x)  =  (Jlt(z)- 

zeDt  (x) 

This  indicates  the  fact  that  there  are  spheres  of  the  same  radius  but  different  volumes.  From  the 
above  observation  about  insertion-deletion  spheres,  good  codes  should  have  codewords  x  with  Vt(x)  of 
smaller  volume.  This  requires  codewords  x  with  fewer  runs.  One  way  to  create  such  codewords  is  to  use 
a  stationary  Markov  Chain. 

In  comparison  to  the  plethora  of  codes  for  the  Hamming  metric,  there  are  a  few  known 
constructions  of  non-random  IDq(n,d)  codes  and  almost  all  are  for  IDq(n,l)  codes.  See  [19],  [25],  [36], 

[40],  [42],  One  non-random  method  that  we  have  discovered  (and  have  applied  in  some  of  our  programs 
described  in  Section  4)  can  be  explained  as  follows.  Given  an  IDq(n,d)  code  C  of  size  N  an 

IDq  (nk,  kd  +  k  - 1)  code  C(k)  of  size  N  can  be  obtained  by  replacing  every  n-sequence  in  C  with  the  nk- 

sequence  achieved  by  repeating  each  entry  k  times.  For  example  if  k  =  3,  then  0112033  would  be  replaced 
by  000111111222000333333. 
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The  reason  for  choosing  the  insertion-deletion  metric  for  strand  design  is  that  two  3’ -5’  DNA 
sequences  x,  y  of  length  n  can  form  s  bonds  in  a  duplex  only  if  x  and  the  complement  of  y  have  a 
common  subsequence  of  length  s.  No  other  metric  constraint  has  this  property. 

This  is  exhibited  in  Figure  3.  The  unintended  duplex  r:R  can  form  four  base  pairings  because  1 
has  the  subsequence  ttga  and  R  has  the  exact  same  subsequence  TTGA.  Recall  that  1  is  the  complement 
of  r.  Note  these  sequences  are  not  necessarily  contiguous.  For  example,  the  subsequence  ttga  of  1  is  not 
contiguous.  With  this  in  mind,  we  have  the  following  definition. 


3.2  DNA(n,d)  Codes 

Definition  2  [15]  A  DNA(n,d)  code  is  a  collection  of  DNA  strands  (3 ’-5 ’)  of  length  n  that  satisfy  the 
following  constraints: 

1.  The  complement  of  every  strand  in  the  collection  is  also  in  the  collection  (i.e.,  the  code  is  closed  under 
complementation.) 

2.  No  strand  is  equal  to  its  complement. 

3.  The  longest  common  subsequence  between  any  two  strands  in  the  collection  is  at  most  n-d-1. 

Ideally,  strands  from  an  DNA(n,d)  code  form  n  bonds  with  their  complement  strands  in  the 
formation  of  intended  hybridized  duplexes,  while  at  most  n-d-1  bonds  occur  in  any  unintended  cross- 
hybridized  duplexes.  Thus  if  d  is  large  enough  and  the  reactions  are  carried  out  at  a  temperature  that 
exceeds  the  n-d- 1  bonding  threshold,  but  is  below  the  n  bonding  threshold,  then  cross-hybridization  will 
be  essentially  eliminated. 

If  we  set  0  =  A,  1  =  T,  2  =  C,  and  3  =  G,  then  DNA(n,d)  codes  are  a  subclass  of  ID4  (n,d)  codes 
because  an  ID4(n,d)  only  satisfies  condition  3  of  a  DNA(n,d)  code.  The  reasons  for  conditions  1  and  2 
can  be  observed  in  FIGURES  6-9,  and  are  discussed  Section  5.  Using  this  numeric-letter  identification, 
the  reverse  complement  of  a  4-ary  sequence  is  defined  exactly  as  the  Watson-Crick  complement.  Using 
this  definition,  an  ID4  (n,d)  that  also  satisfies  condition  1  of  a  DNA(n,d)  is  called  a  RC(n,  d)  code. 

Asymptotic  lower  bounds  have  been  established  for  RC(n,d)  codes  have  been  obtained.  Let 
rc(n,d)  be  the  maximum  size  of  a  RC(n,d)  code.  Clearly  rc(n,d)  <  id4(n,d).  Then  as  n  — »  co  we  have 

rc(n,d)>(d!)2(^)  4l(l  +  o(l)).  (1) 
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As  was  the  case  for  IDq(n,d)  codes,  this  lower  bound  by  was  achieved  by  standard  random 

coding  arguments  based  on  an  ensemble  of  codewords  generated  by  using  the  uniform  distribution  of  the 
q-ary  symbols. 

Let  C  be  a  DNA(n,d)  of  size  2N.  We  can  partition  C  into  two  halves,  each  half  free  of  the 
complement  of  any  other  strand  in  the  given  half.  Let  R(C)  =  {r, , . . . ,  rN }  denote  one  of  these  halves  and 
let  L(C)  =  {lj,  ...,1N}  (where  r;  is  the  complement  of  I , )  denote  the  other  half.  An  example  of  a 
DNA(8,3)  partition  in  this  way  is  given  below  in  TABLE  1.  Note,  for  any  x,  y  in  this  DNA(8,3)  we  have 
lcs(x,y)  =  4.  In  TABLE  1  all  sequences  are  given  3'-5'. 

In  applications  the  subsets  R(C)  and  L(C)  have  complementary  functions.  For  example,  the 
strands  in  R(C)  can  function  as  molecular  tags  or  sites  to  write  on  a  molecule  while  the  strands  in  L(C) 
can  function  as  probes,  extractors  or  site  blockers.  In  Section  5,  we  describe  how  the  strands  in  a 
DNA(n,d)  code  can  be  used  to  construct  a  biomolecular  computing  architecture.  In  that  architecture,  each 
R(C)  and  L(C)  each  have  two  distinct  functions  at  different  points  in  the  procedure.  At  one  point  of  the 
procedure,  r;  is  used  to  write  "1"  and  1;  is  used  to  write  "0."  At  another  point,  r;  is  used  to  read  "0"  and 
1;  is  used  to  read  "1."  In  FIGURES  6-9,  we  also  indicate  why  the  insertion-deletion  distance  needs  to  be 
obtained  for  codewords  between  and  inside  each  of  the  halves. 


R(C)  L(C) 


r,  =  aattttaa 

r7=  atgcgttg 

1 1  =  ttaaaatt 

17  =  caacgcat 

r2  =  taaccccg 

rg=  cccccccc 

12  =  cggggtta 

lx  =  gggggggg 

r3=  ttccaagg 

r9=  aaaaaaaa 

13  =  ccttggaa 

19  =  tttttttt 

r4  =  ggccaatt 

r10  =  ggtttccc 

14  =  aattggcc 

lio=gggaaacc 

r5=  gctacggg 

rn=  cccctttt 

15  =  cccgtagc 

1 ! !  =  aaaagggg 

r6=  gtattgat 

ri2  =  ttttgggg 

16=  atcaatac 

112=  cccctttt 

TABLE  1 


4.  Results,  Discussion 

4.1  DNA(n,d)  Code  Generation  Programs 

We  have  programs  that  generate  very  good  random  and  pseudo-random  DNA(n,d)  (IDq(n,d)) 
codes.  One  reason  that  we  believe  we  can  construct  better  bounds  for  IDq(n,d)  and  DNA(n,d)  codes 
with  a  Markov  chain  approach  stems  from  the  great  improvement  that  we  achieved  by  using  a  Markov 
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chain  to  generate  codewords  for  our  randomly  constructed  codes.  In  [9],  a  random  ID(20,5)  of  size  1024 
was  generated  by  a  using  a  uniform  distribution.  This  code  was  pruned  to  a  DNA(20,5)  code  of  size  16 
by  experimental  and  computational  methods. 

On  a  2.3  ghtz  Pentium  PC,  we  generated  a  DNA(20,5)  code  of  size  3038! 

Given  two  n-sequences  x,  y,  all  of  our  programs  use  a  "folklore1’  dynamic  programming 
algorithm  to  find  the  lcs(x,y)  .  This  is  essentially  described  in  [10].  The  complexity  of  this  subroutine  is 

0(n  ) .  In  [10],  an  improvement  of  the  "folklore"  algorithm  is  given  and  we  plan  to  incorporate  this  in 
our  future  programs. 

Our  very  first  program  used  a  (reverse-)complement  cyclic  code  as  an  initial  set  from  which  to 
find  a  DNA(n,d)  code  as  a  subcode.  However,  we  eventually  realized  that  cyclic  codes  have  too  much 
symmetry  to  generate  large-size  DNA  subcodes. 

The  next  step  was  based  on  generating  uniformly  distributed  independent  random  n-sequences, 
but  from  the  discussion  in  Section  4,  one  can  understand  that  the  volumes  of  the  spheres  of  a  fixed  radius 
centered  at  many  of  these  n-sequences  is  too  large.  We  did  not  get  the  desired  performance  here. 


Our  most  recent  programs  generate  candidate  n-sequences  x  in  the  following  way.  The  value  of 
x !  is  selected  from  {a,c,g,t}  with  uniform  probability.  Then,  the  remaining  entries  are  generated  by  a 
stationary  Markov  chain  given  by  transition  matrix  M  k  with  parameter  k. 


k-1  k-1 


Mk  = 


a  (  k_3 

I  k 

I  k-i  k-i  k-i 

cl  k 

1  k-l  k-l  k-l 


k  1  k  1  k 


k 

-1  k-3 

k  2 


FIGURE  4 


The  average  number2  of  runs  in  the  codewords  of  this  ensemble  is 


3n 

T' 


Thus  higher  values  of  k 


give  sequences  with  fewer  runs.  These  sequences  have  low  volume  spheres  of  the  desired  radius. 


2  (q-l)k 
n 


for  q^4  . 
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However,  the  higher  the  k,  the  fewer  the  number  of  sequences  generated.  At  present,  our  programs  take  a 
dynamic  heuristic  approach.  These  programs  start  a  high  value  of  k  and  then  check  all  values  of  the 
permitted  length  of  the  longest  common  subsequence.  This  continues  for  a  set  number  of  cycles  over 
which  no  new  codeword  is  added.  The  next  value  of  k  is  set  by  finding  the  next  highest  value  of  k  for 
which  a  codeword  can  be  added  to  the  growing  code.  Here  dichotomy  is  used.  This  heuristic  has  worked 
very  well  and  is  much  better  than  the  uniform  codeword  generation  method.  For  example,  by  using  our 
Markov  chain  heuristic,  we  can  generate  DNA(15,5)  code  of  size  104.  This  code  is  exhibited  in  TABLE 
2  (only  R(C)  is  given,  L(C)  follows.)  Using  the  uniform  distribution  method,  we  could  only  generate 
DNA(15,5)  code  of  size  14. 


R(C) 


ccccccccccccccc 

aaccccggattttta 

cctttaaggatttcc 

tttgggccccagcct 

ttttttttttttttt 

ttggggaaaaacccc 

cttttcccgtaactc 

aagtaaggtagcagg 

aaaaaaggggggggc 

ttttggaaaaatttt 

ggtttcccttcggtt 

gccgtgggctggaac 

gggcccttaaaaaaa 

ggggaaaaaggttgg 

ttccaaaattaaacc 

tctgcaaacaagcag 

ttttttccccggggg 

ggaatttggggttcc 

aaagggggctttacg 

gtcctttgtcgcctg 

ggggggaaatttttt 

ttcggggggcccggg 

aaacaactttgggca 

tgcctcccgcgattg 

aaaaaaaaccctttt 

ccaaaaaccccgaaa 

ttagtttgaagcttg 

acaatcgtatcccga 

tttacccccccaaaa 

cataattttggccgg 

gggtggattcaagca 

attactggctggcat 

aaaggggcccccccc 

acccctttaacctgg 

cccattcggccaaca 

agaattccatacctt 

cccccctttggggcc 

acaaaaaaggtcaat 

acaggccagtccggg 

ttggtcgtctttcac 

tttaaaaaaaaaggg 

cgggccagggtttaa 

ttccaaggggtccaa 

gacgacccgataggt 

aaagggccggaaaat 

cccccaaaaggccct 

ggactttagtcaatt 

tttgatgggactacg 

aatcccccccagggt 

gttccgttttaaaat 

cgtcggttaggcccc 

gagcggtcggtactt 

TABLE  2 


5.  Conclusions 

5.1  DNA  Computing  with  DNA(n,d)  Codes 

To  give  an  example  of  how  DNA(n,d)  codes  can  be  applied  to  biomolecular  computing,  we 
discuss  algorithm  and  architecture  in  [6].  In  [6]  a  total  of  80  distinct  strands  (40  library  encoding,  40 
filtering)  were  used  to  solve  a  20  variable  SAT  problem.  We  show  that  a  DNA(n,d)  of  size  can  be  used  to 
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encode  and  filter.  In  [6],  a  strand  either  encodes  or  filters  exclusively.  Note,  there  are  other  architectures 
that  can  be  constructed  using  DNA(n,d)  codes  ,e.g.,  the  variants  of  the  sticker  method  [33]. 

Note  that  because  of  the  (assumed)  properties  of  our  DNA(n,d)  code,  we  need  only  half  as  many 
distinct  strands  as  were  used  in  [6] . 

A  DNA  bit  string  of  length  N  is  a  DNA  molecule  (single  long  strand)  that  consists  of  N  distinct 
non-overlapping  substrands  X1,X2,  ...,XN  and  N-l  identical  DNA  molecules  S  that  are  located  between 

any  two  consecutive  Xi?Xi+1in  the  DNA  bit  string^.  Suppose  we  have  a  DNA(n,d)  code  C  of  size  2N 
partitioned  into  R(C)  and  L(C).  A  Lipton  encoding  [26]  can  be  used  to  construct  a  DNA  library  of  2N 
distinct  DNA  bit  strings  X1SX2S...Xn_1SXn  were  X;  =  r;  or  X;  =  lj  for  eR(C),  1;  eL(C).  If  we 
think  of  r;  =  1  and  1;  =  0 ,  then  we  have  a  library  of  DNA  molecules  that  encode  all  binary  sequences  of 
length  N.  Using  a  subcode  {rl5...,r7}  u  {Ll5 ..  ,,17}  of  the  above  DNA(8, 3)  code  in  TABLE  1 ,  the  encoding 
of  the  sequence  1000101  is  given  in  FIGURE  5. 

3’  aattttaa  S  cggggtta  S  ccttggaa  S  aattggcc  S  gctacggg  S  atcaatac  S  atgcgttg  5’ 

FIGURE  5 

As  indicated  above,  we  identify  DNA  bit  strings  and  binary  sequences.  For  I  c:  [N]  and  (c,  )j  eI  a 
binary  sequence,  let  K  be  the  following  a  subset  of  binary  N-sequences  defined  as 

K  =  {(b;) :  b;  =  e;  for  some  i  el}.  K  is  the  set  of  all  binary  sequences  that  satisfy  the  disjunctive 
clause  K'  over  N  Boolean  terms,  each  of  which  is  a  variable  x;  (if  e;  =  1 )  or  its  negation  ~  x;  ( if  e;  =  0  .) 
The  main  "computing"  idea  in  [6]  is  an  iteration  of  the  following:  Given  a  subset  T  of  DNA  bit  strings 
and  a  set  K  defined  above,  the  subset  TnR  can  be  extracted  from  the  set  T  by  hybridization.  See 
Example  2. 

Example  2.  Suppose  T  is  the  set  of  all  2  DNA  bits  stings  formed  by  using  our  DNA(8,3)  subcode  of 
size  14  given  above.  Suppose  Kj  =  {(b;)  :  bt  =  1  or  b3  =  0  or  b4  =  1  or  b5  =  0}  and 

K2  =  {(b;)  :  b,  =  1  or  b3  =  1  or  b7  =  0}.  If  we  use  the  DNA  bit  string  representation,  then 

Kj  =  {(b;)  :  b |  =  q  or  b3  =  13  or  b4  =  r4  or  b5  =  15}  and  K2  =  {(b4 )  :b2  =  r2  or  b3  =  r3  or  b7  =  17}. 
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To  improve  performance  a  strands  of  synthetic  bases  (e.g.,  iso-G)  could  be  used  as  separator  sequences. 
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Two  corresponding  "filters"  F,  and  F2  are  constructed.  Fj  consists  of  the  probe  strands  lj  ,r3  ,14  ,r5 
affixed  to  a  gel.  Note  that  these  are  the  complement  stands  to  those  that  appear  in  Kj .  Thus  Fj  could  be 
called  the  complement  filter  of  Kj.  Similarly,  F2  consists  of  the  strands  12  ,  13,  r7  affixed  to  a  gel. 
When  T  is  passed  through  Fl5  only  the  strands  in  Kj  hybridize  with  the  probes  affixed  in  F,  and  remain 
in  the  gel.  The  strands  that  pass  through  the  filter  Fj  are  discarded.  The  strands  that  remain  in  the  Fj  gel 
are  exactly  TnKj.  These  strands  can  be  "washed"  from  the  filter  F,  and  recovered.  Then  these 
recovered  strands  are  passed  through  filter  F2 .  Only  the  strands  in  TnKj  and  in  K2  hybridize  with 
probes  in  F2 .  What  passes  through  is  discarded.  The  strands  that  remain  in  the  F,  gel  are  exactly 
TnKj  n  K2 .  Thus  the  strands  TnKj  n  K2  in  the  F2  gel  are  all  binary  sequences  that  satisfy  the 
conjunction  K\  a  Ki  of  the  clauses  K\  and  K2 . 

Given  the  above  descriptions,  the  general  SAT  problem  can  be  thought  of  as:  Given  disjunctive 

p 

clauses  K\ ,  K2,...,Kp,  then  is  fW  0  ?  By  constructing  the  corresponding  complement  filters 

i=l 

Ft  ,  F2,...,F  and  iterating  the  above  process,  the  answer  is  "yes"  if  and  only  if  there  are  any  strands  in 
Fp- 

All  of  the  above  analysis  is  contingent  on  avoiding  all  of  the  possible  cross-hybridization 
situations  that  a  DNA(n,d)  code  intends  to  avoid.  We  now  give  some  examples  of  potential  cross¬ 
hybridizations.  For  the  filter  to  work,  we  need  correct  reads.  See  FIGURE  6. 


<o 

o6l3C|6030 

3’  aattttaa  S  cggggtta  S  ccttggaa  S  aattggcc  Sgctacggg  S  atcaatac  S  atgcgttg  5’ 

correct  read 

a  correct  DNA  bit  string  is  affixed 

FIGURE  6 

Incorrect  reads  are  avoided  by  ensuring  that  codewords  inside  of  L(C)  (R(C))  have  the  proper 
insertion-deletion  distance.  In  FIGURE  7,  only  four  base  pairings  can  form  in  a  "bad  read."  Here  c5  is 


3’  aattttaa  S  cggggtta  S  ccttggaa  S  aattggcc  S  gctacggg 

incorrect  read 

incorrect  DNA  bit  string  is  affixed 

FIGURE  7 


S  atcaatac 


o6i2q6c>co 
S  a  tg  eg  ttg  5’ 
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"incorrectly  reading"  r7.  Four  bonds  can  form  because  lcs(r5,r7)  =  4 .  The  common  subsequence 
between  r5=  gctacgg  and  r7=atgcgttg  is  tcgg. 

Inter-DNA  bit  string  interactions  are  prevented  by  ensuring  that  the  insertion-deletion  distance 
inside  of  R(C)  (L(C))  or  between  R(C)  and  L(C)  is  sufficient.  In  top  pair  in  FIGURE  8,  only  four  bonds 
can  form  between  the  two  strands  at  the  indicated  positions  because  lcs(l2,r5)  =  4 .  The  common 
subsequence  between  12=  cggggtta  and  r5=  gctacggg"  in  Figure  8  is  gggg.  Similarly  ensuring  the  proper 
insertion-deletion  distance  prevents  intra-DNA  (haiipin)  interactions.  See  FIGURE  9. 

3’  xxx  S  taaccccg  S  xxx  S  xxx  S  xxx  S  xxx  S  xxx  5’ 
xxx  s  xxx  s  s  xxx  £  xxx  £  xxx  £xxx  ,e 

3’  xxx  S  xxx  S  xxx  S  xxx  S  xxx  S  xxx  S  atgcgttg  5’ 

XXX  £  XXX  £  dSbqBdoo  S  XXX  S  XXX  £  XXX  £  XXX  ,E 

DNA  bit  string  interactions 

These  DNA  bit  strings  could  be  prevented  from  being  affixed 

FIGURE  8 


5.2  DNA  Computing  and  Data  Mining 


We  now  discuss  a  problem  that  is  of  particular  interest  to  us.  Let  [2N]  denote  the  power  set  of 
[N],  Suppose  we  have  DNA(n,d)  code  C  of  size  2N  partitioned  into  R(C)  and  L(C). 
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Problem  1.  Let  Ph  P2,  Pm  be  fixed  subsets  of  [N], 


a.  Find  all  S  cz  [N]  with  S  (£  P;  for  all  i  with  1  <  i  <  m . 

b.  Find  all  T  cz  [N]  with  P;  c£  T  for  all  i  with  1  <  i  <  m . 

Both  of  these  problems  are  related  and  are  simplified  forms  of  the  general  SAT  problem.  They 
can  be  solved  by  the  method  described  above.  (These  are  simplifications  because  no  negations  appear  in 
the  clauses.) 

There  is  one  important  difference.  In  the  SAT  problem,  only  one  solution  needs  to  be  found.  Here 
all  solutions  are  required. 

Let  (b;)  be  a  binary  n-sequence.  As  above,  let  K;  =  {(bj) :  bj  =  1  for  some  j  £  PJ.  Clearly  all 

m 

S  <£  P;  for  all  i  with  1  <  i  <  m  is  the  set  of  all  S  with  incidence  vector  in  f>,  In  the  DNA  bit  string 

i=l 

representation,  Kj  =  {(bj) :  bj  =  rj  for  some  j  £  PJ.  The  associated  filter  F;  consists  of  {{  :  j  Pj  j .  If 
a  set  S  of  DNA  bit  strings  of  length  N  is  passed  through  F, ,  then  only  the  bit  strings  in  K;  remain  in  the 
gel  of  Fj .  Starting  with  all  possible  DNA  bit  strings  and  iterating  the  filter  process  outlined  above  m 
times,  we  arrive  at  Fm .  Fm  contains  all  the  DNA  bit  string  representations  of  the  solutions  to  Problem  la. 
Problem  lb  can  transformed  into  Problem  la  because  P;  T  if  and  only  if  [N]  -  T  [N]  -  P; . 

The  most  straightforward  application  of  the  above  problem  is  in  the  identification  of  independent 
sets  in  a  graph  (or  hypergraph).  In  Problem  lb,  if  one  takes  all  the  edges  of  a  simple  graph  G  as  the 
collection  {PJ ,  then  the  set  of  all  T  is  the  collection  of  independent  sets  in  G. 

Problem  la  can  be  applied  to  the  identification  of  maximal  frequent  sets  in  a  data  base  [12],  [18], 
[27],  [28],  [29],  [30].  The  reason  that  this  is  of  interest  to  us  is  because  the  identification  of  the  maximal 
frequent  sets  is  the  main  computational  bottleneck  in  the  data  mining  of  association  rules. 

The  relationship  between  data  mining  and  Problem  1  is  this.  If  the  sets  {PJ  are  selected  properly, 
the  subsets  S  will  be  candidates  for  maximal  frequent  sets.  See  [12],  [18],  [27],  [28],  [29],  [30]  and 
Section  12  here.  The  application  of  DNA  computing  is  to  apply  to  Problem  1  an  algorithm  like  that 
described  in  Section  7. 
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FIGURE  10 


The  relationship  between  data  mining  and  Problem  1  is  this.  If  the  sets  {P;}  are  selected  properly 
the  subsets  S  will  be  candidates  for  maximal  frequent  sets.  See  [12],  [18],  [27],  [28],  [29],  [30].  The 
application  of  DNA  computing  is  to  apply  an  algorithm  like  that  described  in  Section  7  to  Problem  1. 

Then  the  resulting  collection  of  DNA  library  strands  that  remain  in  the  final  filter  will  code  for 
maximal  frequent  sets.  What  is  envisioned  is  that  patterns  in  data  are  being  transformed  into  molecules. 
See  FIGURE  10.  If  we  can  read  the  molecules,  then  that  we  have  found  the  patterns.. 
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